Prospector Heads: Generalized Feature Attribution for Large Models & Data

Feature attribution, the ability to localize regions of the input data that are relevant for classification, is an important capability for ML models in scientific and biomedical domains. Current methods for feature attribution, which rely on “explaining” the predictions of end-to-end classifiers, suffer from imprecise feature localization and are inadequate for use with small sample sizes and high-dimensional datasets due to computational challenges. We introduce prospector heads, an efficient and interpretable alternative to explanation-based attribution methods that can be applied to any encoder and any data modality. Prospector heads generalize across modalities through experiments on sequences (text), images (pathology), and graphs (protein structures), outperforming baseline attribution methods by up to 26.3 points in mean localization AUPRC. We also demonstrate how prospector heads enable improved interpretation and discovery of class-specific patterns in input data. Through their high performance, flexibility, and generalizability, prospectors provide a framework for improving trust and transparency for ML models in complex domains.


Introduction
Most ML models are optimized solely for predictive performance, but many applications also necessitate models that provide insight into features of the data that are unique to a particular class.This capability is known as feature attribution, which in unstructured data (e.g., text, images, graphs) consists of identifying subsets of the input datum most re- * Joint senior authorship 1 Department of Biomedical Data Science, Stanford University 2 Cartesia AI 3   sponsible for that datum's class membership (e.g., pixels or patches of an image, often represented as a heatmap).Feature attribution is especially important for scientific and biomedical applications.For example, for a model to assist a pathologist in making a cancer diagnosis, it ideally should not only accurately classify which images contain tumors, but also precisely locate the tumors in each image (Song et al., 2023;Niazi et al., 2019).
Unfortunately, modern ML systems can struggle to perform feature attribution.Most existing attribution techniques attempt to provide "explanations" for trained classifiers (Figure 1) -descriptions of how model weights interact with different input features (e.g., gradients (Simonyan & Zisserman, 2014), attention (Jetley et al., 2018)) or of how each feature contributes to prediction (e.g., SHAP (Lundberg & Lee, 2017), LIME (Ribeiro et al., 2016)).Explanation-based attribution methods are inherently (a) data-inefficient as they require ample labeled training data to train underlying classifiers.Additionally, methods producing explanations can themselves be (b) computationally inefficient (Ancona et al., 2019;Chen et al., 2023) and thus may not actually improve tractability relative to annotation by domain experts, particularly for large inputs.Finally, (c) the attributed features are often found to be inaccurate and irrelevant to target classes (Arun et al., 2020;Zech et al., 2018;Jain & Wallace, 2019;Zhou et al., 2021b;Bilodeau et al., 2022).
We explore whether foundation models (FMs) can be used to solve challenges (a-c) without traditional explanations.Prior work demonstrates that FMs learn high quality data representations and can learn class-specific properties through a few labeled examples (Bommasani et al., 2021;Brown et al., 2020;Gondal et al., 2023).However, it is unclear whether FM representations can be used to perform feature attribution in a scalable and accurate manner.Our key insight is to build on top of FM representations, rather than explain an FM fine-tuned as an end-to-end classifier.
In this work we present prospector heads (a.k.a."prospectors"), simple modules that aim to equip feature attribution to any encoder -including FMs -just as one would equip classification heads.Prospectors inductively reason over two layers: layer (I) categorizes learned representations into a finite set of "concepts" and layer (II) learns concepts' spatial associations and how those associations correlate with a target class.To (a) enable data efficiency, prospectors are parameter-efficient and with only hundreds of parameters.To (b) limit time complexity, prospectors operate with efficient data structures and linear-time convolutions, all without model backpropagation.To (c) improve attribution accuracy, prospector heads are explicitly trained to perform feature attribution, unlike explanation methods.
We show that prospector heads outperform attribution baselines over multiple challenging data modalities.Prospectorequipped models achieve gains in mean area under the precision-recall curve (AUPRC) of 8.5 points in sequences (text documents), 26.3 points in images (pathology slides), and 22.0 points in graphs (protein structures) over the top modality-specific baselines.Additionally, we show that prospector-equipped FMs are particularly robust to variation in the prevalence and dispersion of class-specific features.Finally, we also present visualizations of prospectors' internals and outputs to demonstrate their interpretability in complex domain applications.

Related Work
To adequately motivate our approach (Section 3), this section focuses on central methods ideas.We present a full version of Related Work, including baselines, in Appendix A.
Modern encoders & context sizes: Most modern encoders for unstructured data operate on tokens, or relatively small pieces of a datum, and their representations.Tokens can be user-prespecified and/or constructed by the encoder itself (potentially with help from a tokenizer), where these encoders are respectively referred to as partial-context and full-context (Figure 2).Due to computational constraints, high-dimensional unstructured data (e.g., gigapixel images) often require user-prespecified tokens (i.e., patches) and partial-context encoders that embed each token (Lu et al., 2023;Huang et al., 2023;Klemmer et al., 2023;Lanusse et al., 2023).
Gradient-based saliency and attention maps have been used to explain partial-context classifiers for high-dimensional unstructured data like gigapixel imagery (Campanella et al., 2019;Chen et al., 2022c).However, studies report low specificity and sensitivity (Machiraju et al., 2022) in part because attribution for the entire datum is built by concatenating attributions across prespecified tokens.Partial-context strategies incorrectly assume prespecified tokens are independent and identically distributed (IID).

Concept-based modeling:
The use of concepts in ML inherently increases model interpretability by forcing models to reason over unstructured data with respect to said concepts.Concepts themselves can be human-derived, machinederived (Ghorbani et al., 2019;Talukder et al., 2024), or co-derived with humans in the loop (Lam et al., 2024).

Prospection: Attribution sans Explanation
Prospectors are designed to perform few-shot feature attribution for high-dimensional data while meeting challenges (ac).Instead of explaining end-to-end classifiers, prospectors interface with encoders by adapting their token embeddings.Crucially, prospectors foster a form of inductive reasoning over token embeddings to learn class-specific patterns.The use of tokens as the core unit of analysis depends on the key assumption that the equipped encoders have learned adequate distributional semantics in large-scale pretraining.Prospectors can then learn class-specific patterns in small labeled datasets via a simple two-layer module.In layer (I), prospectors transform token embeddings into spatially resolved concepts learned from the training set, constructing a parsimonious "vocabulary" or "codebook" that can be user-verified and/or user-defined.Layer (II) then attributes scores to each token using a novel form of graph convolution that operates on concept frequencies and co-occurrences.The following sections describe the inference and fitting procedures of each layer.

Preliminaries
To enable any encoder to perform feature attribution regardless of input modality, we first define a generalized language for unstructured data.Any unstructured datum can be represented by a map graph G(V, E) where each vertex v ∈ V represents a discrete token, or piece of that datum in Euclidean space (Definition C.2). G is composed of T = |V| tokens.For example, in image data, tokens can be defined as pixels or patches.An edge e i↔j ∈ E connects vertex v i to v j .Both G's token resolution and token connectivity are defined based on data modality (Figure S1).Problem setup: Suppose we have a dataset containing map graphs G and binary class labels y.We assume that a class y graph G(V, E) contains a set of class y -specific vertices V y ⊆ V, with |V y | ≥ 1 (Zhou et al., 2016).One main goal of feature attribution is to locate V y in each datum given a set of (G, y) pairs as a training dataset.This task is inherently coarsely supervised (Robinson et al., 2020) and is discussed further in Appendix A.

RECEIVING TOKEN EMBEDDINGS
Prospectors receive token embeddings x 1 . . .x T from an equipped encoder and update map graph G such that each vertex v i ∈ V is featurized by an embedding x i ∈ R d .This vertex-specific "feature loading" uses the notation: G[v i ] := x i .Details for partial-and full-context encoders are specified in Appendix C.1.

LAYER I: QUANTIZING EMBEDDINGS
Next, prospectors use an encoder's learned semantics to define K spatially resolved concepts C = {1, . . ., K}.This is achieved by quantizing each token embedding x ∈ R d as a scalar concept c ∈ C: When the quantize layer (Section 3.4.1) is applied over the full graph G, it is transformed into graph S with the same topology as G, but with categorical vertex features S[v i ] := c i ∀i.We refer to S as a data sprite due to its low feature dimensionality compared to G (a data compression ratio of d).Intuitively, the heterogeneity of S is parameterized by the choice of K.This layer is depicted in Figure 4.

LAYER II: CONVOLUTION OVER CONCEPTS
Prospectors next perform feature attribution using a form of graph convolution over sprite S.This convolution requires a global kernel ω that computes an attribution score a ∈ R for each vertex v based on the concepts c i (i.e., monograms) and co-occurrences c i , c j (i.e., skip-bigrams) present within the graph neighborhood defined by receptive field r.The kernel ω serves as a form of associative memory and can be conceptualized as a dictionary, scoring each concept monogram or skip-bigram in the combinatorial space Z = C ∪ {C ⊗ C}, where ⊗ is the Cartesian product.The global kernel is fit over the training set (Section 3.4.2).
To perform feature attribution at inference time, we apply the fitted kernel over each vertex in a datum to produce a prospect map P .P is a map graph with the same topology as G and S but featurized by scalar continuous attribution scores P [v] := a.We call this layer K2conv in reference to kernel ω's implicit structure (Definition C.5).An attribution score a i is computed for each vertex v i in S, where N r represents all vertices within the r-neighborhood of v i (including v i itself): where ω⟨•⟩ denotes dictionary lookup.The above expression resembles the energy function for 2D Markov random fields, but adjusted to allow for longer-range dependencies in the second term via skip-bigrams.The resulting prospect map P targets class-specific region V y by assigning high absolute positive or negative values to each token.Intuitively, r parameterizes the level of smoothing over P by modulating the number of neighboring tokens used to compute a token's importance.This layer is depicted in Figure 5.

Prospector Fitting
In our implementation, layers (I) and (II) are fitted separately and sequentially using the procedures below.Further details for each layer are found in Appendix C.3.

QUANTIZER FITTING (LAYER I)
Token embeddings sampled from across the training set are partitioned into K subspaces using an unsupervised algorithm (e.g., K-means clustering).Afterward, each subspace represents a semantic concept c ∈ C discovered in the corpus.To reduce computation, clustering can be performed over a representative sample (> 10 3 ) of the token embedding space, randomly sampled without replacement.Fitting is depicted in Figure 4. 2 .The kernel is fit in two steps, as outlined below.
Step 1: Computing frequencies & co-occurrences.For each sprite S in the training set, prospectors first build a datum-level representation in order to learn dataset-wide patterns.This is performed by the rollup operator, which traverses S's vertices, tracks concept monogram and skipbigrams z i ∈ Z, and counts their frequencies over all rneighborhoods.This operation constructs a sprite embedding z ∈ R |Z| , which resemble "bag-of-words" vectors with longer-range "skip" interactions.Sprite embeddings are rescaled to account for differences in baseline frequencies (e.g., using TF-IDF (Sparck Jones, 1972)) and thus can be viewed as probabilities: P(c i ) for monograms and P(c j , c k ) for skip-bigrams.The rollup operator and this step as a whole are described in Algorithm 1 and Figure S2.
Step 2: Learning kernel weights.Prospectors next use the datum-level sprite embeddings z to learn a vector w ∈ R |Z| of class-specific weights for each monogram and skipbigram across the entire training set.After fitting w, we construct ω as a dictionary mapping each element in Z to its corresponding weight in w.We implement two approaches to learning weights, which make up the two main prospector variants: a linear classifier h w and a parameter-free foldchange computation.These variants are discussed further in Appendix C.3.3 and depicted graphically in Figure 5.
Linear classifier: This variant trains a linear classifier h w (z) = w ⊺ z to learn a mapping from z → y over the training dataset.The learned coefficients w then represent the class-specific importance of each index in z.We implement this as a logistic regression with elastic net regularization with the mixing hyperparameter λ.
Fold-change computation: Inspired by bioinformatics (An-ders & Huber, 2010), this variant involves first computing mean sprite embeddings for each class over the training data.For example, for the negative class, z 0 = 1 |D0| S (i) ∈D0 z (i) , where D 0 is the subset of the training dataset (S (i) , y (i) ) for which y (i) = 0.This mirrors the "baseline vector" commonly used by popular feature attribution methods (Sundararajan et al., 2017;Bilodeau et al., 2022;Afchar et al., 2021).Then, we compute w as a foldchanges w = log 2 (z 1 ) − log 2 (z 0 ) and select significant weights using a hypothesis test for independent means.The latter step serves as a form of regularization.

Meeting Challenges with Intentional Design
Prospectors overcome the limitations of current feature attribution methods by observing the following design principles.Firstly, for (a) data efficiency and few-shot capabilities, prospectors are parameter efficient due to the sole use of concept monograms and skip-bigrams to build its kernelat maximum only requiring 2K + K 2 parameters.Both variants for computing importance weights w are thus data efficient due to their parsimony.Secondly, prospectors are (b) computationally efficient: by operating as an equippable head, prospectors are "plug-in-ready" without encoder retraining (Kim et al., 2017) and or backpropagation.The combination of efficient data structures and modeling primitives such as dictionaries and convolutions allow prospectors to efficiently scale feature attribution to high-dimensional data: namely, linear-time with respect to the tunable number of tokens T .We outline runtime complexity and speed benchmarking in Sections C.4 and D.1.Finally, prospectors achieve (c) improved localization and class-relevance by explicitly training on token embeddings to learn G y instead of using end-to-end classifiers to identify G y post hoc.We detail other favorable model properties in Appendix C.6.

Datasets, Encoders, & Baselines
We evaluate prospectors using three primary tasks, each representing a different data modality (sequences, images, and graphs).Each also poses unique challenges for prospector training and feature attribution: class imbalance (sequences), high input dimensionality with few examples (images), and very coarse supervision (graphs).As is common in scientific and biomedical data, all three datasets are amenable to the multiple instance assumption (MIA) -that class 1 data largely resemble class 0 data with the exception of tokens only found in class 1 data (Amores, 2013;Foulds & Frank, 2010).Details for each dataset's construction are shared in Appendix D.5.For each task, we select representative encoders to which we equip prospector heads and relevant baseline attribution methods.We summarize encoders in  Sequences (1D): key sentence retrieval in text documents.
Retrieval is an important task in language modeling that provides in-text answers to user queries.For this task, we use the WikiSection (Arnold et al., 2019) benchmark dataset created for paragraph-level classification.We repurpose WikiSection to assess the ability to retrieve target sentences specific to a queried class.We specifically use the "genetics" section label as a query, and class 1 data are defined as documents in the English-language "disease" subset that contain this section label.Our goal is to identify sentences that contain genetics-related information given only coarse supervision from document-level labels.After preprocessing the pre-split dataset, our dataset contained 2513 training examples (2177 in class 0 , 336 in class 1 ) and 712 test examples.
The relationship between sentences in each document is represented as a graph with 2-hop connectivity (Figure S1).
Encoders & baselines: We assess two pretrained language models, MiniLM (Wang et al., 2020) and DeBERTa (He et al., 2020;2021), used at partial-context.While DeBERTA is an off-the-shelf LLM for zero-shot classification (ZSC) and natural language inference (NLI), MiniLM is a sentence and paragraph embedding model -thus requiring prospectors to perform feature attribution at the sentence-level.
For baselines attribution methods, we present a mix of (1) su- pervised heads and ( 2) off-the-shelf LLM inference.Firstly, supervised heads train on token-level class labels to identify class-specific sentences in testing.Specifically, we train a multi-layer perception (MLP) on labeled token embeddings and a one-class support vector machine (SVM) trained solely on class 0 token embeddings.In the latter case, we perform novelty detection to identify class 1 tokens.While not traditional explanation methods, the MLP and SVM heads are given a large advantage as semi-and fully supervised baselines (as opposed to prospectors, which are coarsely supervised at the datum-level).For LLM inference, we used DeBERTA to output sentence-level ZSC probabilities (i.e., logits), NLI entailment scores, NLI entailment attention, and pooled Shapley values for ZSC.Implementation details are listed in Appendix D.9.S1).
Encoders & baselines: We equip prospectors to four encoders: tile2vec (Jean et al., 2019), ViT (Dosovitskiy et al., 2020), CLIP (Radford et al., 2021), and PLIP (Huang et al., 2023).The first two encoders are trained with partial context, where tile2vec is unsupervised while ViT is weakly supervised with image-level label inheritance (Machiraju et al., 2022).Details on encoder training are provided in Appendix D. CLIP serves as a general-domain vision-language foundation model (VLM) and PLIP serves as a domainspecific version of CLIP for pathology images.Both VLM encoders are pretrained and used for partial-context infer-ence on prespecified image patches.We choose two popular and computationally feasible explanation-based attribution baselines (Section 2): concatenated mean attention (Chen et al., 2022c) for ViT and concatenated prediction probability (Campanella et al., 2019;Machiraju et al., 2022;Halicek et al., 2019) for ViT, CLIP, and PLIP.
Graphs (3D): binding site identification in protein structures.Many proteins rely on binding to metal ions in order to perform their biological functions, such as reaction catalysis in enzymes, and identifying the binding-specific amino acids is important for engineering and design applications.We generated a dataset of metal binding sites in enzymes using MetalPDB (Putignano et al., 2018), a curated dataset derived from the Protein Data Bank (PDB) (Berman et al., 2002).Focusing on zinc, the most common metal in the PDB, we generate a gold standard dataset of 610 zinc-binding (class 1 ) enzymes and 653 non-binding (class 0 ) enzymes (see Appendix D.5.3).Each protein structure is defined using the positions of its atoms in 3D space and subdivided into tokens representing amino acids (a.k.a."residues").The relationship between residues is represented as a graph with edges defined by inter-atomic distance (Figure S1).This task is particularly challenging due to potentially overlapping class-specific features (i.e., proteins of both classes are metal-binders), highly heterogeneous background data (proteins in train and test sets adopt a wide variety of structural folds), and relatively small target regions, making this an example of a "needle-in-thehaystack" task (Pawlowski et al., 2019).
Encoders & baselines: We apply prospector heads to three encoders: COLLAPSE, an FM which produces embeddings of the local 3D structure surrounding each residue (Derry & Altman, 2023); ESM2, a protein LLM which produces embeddings for each residue based on 1D sequence (Lin et al., 2023); and a simple amino acid encoder (AA), where each residue is one-hot encoded by amino acid identity.By construction, ESM2 is a full-context encoder while COLLAPSE and AA are partial-context encoders.We present three baselines built on top of a supervised GAT (Veličković et al., 2017) classifier head (trained on protein-level labels) to identify binding residues: Attention, Shapley values (SHAP), and GNNExplainer (Ying et al., 2019).Implementation details are listed in Appendix D.9.

Results
Prospectors outperform baseline attribution methods in region localization and generalize across data modalities.In all tasks, prospectors achieve higher AUPRC and AP than baseline methods, often with large improvements (Figure 6).For text retrieval, we improve mean test-set AUPRC to 0.711 from 0.626 (i.e., 8.5-point gain) with the top supervised baseline (MLP head) and from 0.584 with the top LLM inference baseline (NLI entailment) -in summary, MiniLM with an equipped prospector head is able to outperform DeBERTa's baselines by 12.7 points in AUPRC despite being 5× smaller in size and with relatively limited pretraining (Table S5).We also observe improved localization over baselines for Camelyon16 (26.3 points in AUPRC and 8.8 points in AP) and MetalPDB (22.0 points in AUPRC and 8.8 points in AP).For the MetalPDB dataset, the optimal methods tend to exhibit bimodal performance, with almost perfect predictions for a subset of the test dataset (particularly cysteine-dependent binding motifs, see Figure 9) and poor performance on other subsets, resulting in the clustering of points around 0.5 and 1.0 AUPRC.This behavior suggests that AP more clearly reflects task performance, highlighting the ability of prospectors to identify small conserved binding patterns.
The choice of encoder is key to optimal prospector performance.While prospectors overall improve localization performance over baselines regardless of the chosen encoder, the performance gain is maximized by choosing domainspecialized encoders for each dataset.For Camelyon16 and MetalPDB, the combination of prospectors with FM encoders (CLIP, PLIP, COLLAPSE) showed the strongest localization results, as shown in Figure 6.Among FMs, the best-performing encoders are those with the most taskspecificity -PLIP has a domain advantage by virtue of being a CLIP-style encoder trained on pathology images, while COLLAPSE accounts for complex 3D atomic geometry rather than simply amino acid sequence (as in ESM2) or one-hot encoding (AA).Interestingly, we note that the AA encoder presents an exception to encoder generalization, supporting that prospectors themselves can identify salient patterns with rudimentary encoder semantics.This is likely due to the fact that many zinc-binding motifs rely on atomic coordination by three to four cysteine residues, which are otherwise rarely found in such arrangements.For tasks which require the detection of less amino acid-dependent structural patterns, we expect the COLLAPSE encoder to result in optimal prospector performance.
Prospectors are robust to coarse supervision.Next, we explore the relationship between the properties of classspecific regions and localization performance.To characterize class-specific regions, we compute two metrics acting as proxies for coarse supervision (Section 3.2): region prevalence (# class 1 tokens / # tokens) and mean region dispersion (# connected components / mean component size).
For Camelyon16, we plot the relationship between test-set AUPRC and both metrics in Figure 7. Full results over all datasets are presented in Appendix D.11.For each plot, we also display the top baseline method.
Firstly, we observe that most encoders exhibit a positive correlation between region prevalence and localization AUPRC across all modalities.However, some encoders are particularly robust to region prevalance and achieve high AURPC despite low prevalence (MiniLM, PLIP, COLLAPSE), and prospectors are consistently more robust than top baselines over all data modalities.Secondly, mean region dispersion and localization performance (both AUPRC and AP) demonstrate a parabolic relationship -indicating that some level of dispersion is needed for detectable regions, while too much dispersion makes the task challenging.These results recapitulate each task's challenges: the pathology task contains a wide range of dispersion values, while the protein task contains the lowest levels of prevalence and highest levels of dispersion (Appendix C.6).Despite these task differences, prospector-equipped FMs demonstrate an high levels of robustness to coarse supervision across modalities.
Prospectors' sprite embeddings and kernels are interpretable and enable internal visualization.In addition to improved localization performance, prospectors are inherently interpretable because their parameters provide insights into invariant class-specific patterns.Prospect maps visualize the feature attribution outputs in the input token space -but importantly, these maps can be further contextualized by visualizing prospector internals themselves.Due to the use of learned semantic concepts, the global convolutional kernel can be represented as a semantic network or as a heatmap (Appendix C.2), along with each input example as it passes through layers of the prospector head.We illustrate this interpretability for pathology images (Figure 8) and protein structure (Figure 9) using two test-set examples.We first visualize data sprites, which reflect the learned concepts mapped onto data inputs (from layer (I)).By analyzing semantic concepts on the data sprite, it is possible to assign domain-specific meaning to each concept.Additionally, by visualizing concept and co-occurrence frequencies in the sprite embedding, we can identify over-or under-represented patterns within each input.By visualizing the global kernel, which captures dataset-wide concept associations and their correlations with class labels, it is possible to cross-reference between the sprite and the classspecific regions of the resulting prospect map.The ability to visualize the internals of a prospector head in terms of concepts facilitates human-in-the-loop model development and the incorporation of domain knowledge, a major advantage relative to "black box" models.

Updated path
Prospector kernels allow for parsimony to find "hub" concepts.Our pathology visualization (Figure 8) demonstrates a kernel with "hub," or densely connected and highly predictive concepts: concept #4 is indicative of class 1 while concept #9 is indicative of class 0 .Such kernels demonstrate how prospectors do not detract from the rich semantics offered by FM encoders like PLIP for pathology data.
Prospection is robust to concept distributional shifts.Visualizing kernels for protein structures outlines prevalent class 1 -specific concepts in training data (e.g., concepts #7, #17) that are rare in the test set but nonetheless are critical for classification.Despite their low frequency, top prospectors achieved performant localization for this task.The distributional shift between train and test set is a likely explanation for the bimodal localization performance on this task, and suggests that improvements to kernel design and fitting (e.g., feature scaling and choice of K) along with constructing optimally representative training datasets (e.g., for a more varied class 0 ) would improve prospector performance on more difficult data subsets.
Sprite embeddings also carry class signal.Further analysis of learned parameters can also help to better understand the nature of discovered patterns.For example, there may be more than one pattern which results in a particular class label, and differentiating examples that exhibit each pattern can uncover mechanistic subgroups of the data.To demonstrate this, we hierarchically cluster the sample-level sprite embeddings in the MetalPDB test set.This identified two major subgroups of zinc binding sites (Figure S7) defined by the number of cysteine residues coordinating the bound ion.One subgroup is enriched for proteins which contain four coordinating cysteines, while in the other there are one or more histidine residues involved in the binding interaction.Figure 9 shows an example from each cluster, including a visualization of the zinc-binding site on the far right.This finding recapitulates known subtypes of zinc binding motifs (Wu et al., 2010), and more broadly demonstrates the potential for prospectors to discover biological mechanisms when applied to less well-studied phenomena.

Discussion & Conclusion
This work presents prospector heads, encoder-equippable modules for (a) data-efficient, (b) time-efficient, and (c) performant feature attribution.We show that prospectors are both modality-generalizable and encoder-agnostic with particularly dominant performance when equipped to domainspecialized FMs.Finally, we show that prospectors are interpretable through their use of concept-based kernels.
Prospectors' improved localization performance over explanation-based baselines calls into question the underlying assumption of explanations themselves: that end-to-end classifiers implicitly "segment" data in the input token space  S4).We show the same five visualizations as before, as well as a visualization of the atomic configuration of the zinc binding site to illustrate the binding motifs discovered by sprite embedding clustering (Figure S7).Whole proteins are visualized as cartoons instead of graphs for clarity.
en route to making class predictions, and that these "segmentations" can be extracted post hoc.Our results suggest that using machine-derived concepts and modeling classspecific associations directly in the input token space helps to avoid such modeling assumptions.
We believe a key driver of prospectors' performance is the combination of token-level representations with the local inductive bias provided by convolution.This combination fosters a form of inductive reasoning through "token mixing" and kernel construction.Several other aspects of prospector design draw inspiration from ideas across ML research (Appendix A.3), giving insights into their performance characteristics.Our results suggest that FMs in particular contain strong distributional semantics which yield precise feature attribution even with partial-context encoders and coarse levels of supervision.In other words, FMs (in tandem with quantization) remove the burden of long-context reasoning by reducing input data to mosaics of concepts (i.e., sprites).Prospectors can thus functionally operate over long-range dependencies even with a local inductive bias.This claim of capturing short-and long-range dependencies between tokens is backed by prospectors' localization robustness to region prevalence and dispersion.Additionally, because domain-specific FMs do improve performance when they are available (e.g., PLIP vs. CLIP), we hypothesize that as FMs continue to improve and be adapted to new applications and data modalities, so will the utility of prospectors across diverse domains.
Prospectors are flexible and modular by design, enabling not only variable encoders but also simple changes in their fitting.Of the two variants we fitted, the non-trainable fold-change variant was superior for almost all evaluated settings (Appendix D.3).This may be because the variant explicitly learns dataset-wide concept associations and deviations from a class 0 "baseline vector" (Sundararajan et al., 2017;Bilodeau et al., 2022;Afchar et al., 2021) -which closely reflects the MIA (Section 4.1).It is possible that different kernel fitting methods may be better suited to de-tecting different types of class-specific patterns, but further investigation is needed to explore this question.
One limitation of this work is the lack of sensitivity analysis for all design choices and hyperparameters.For example, due to time and compute constraints, we relied on domain knowledge to select token resolution and connectivity for each task instead of testing their impacts on performance.Furthermore, we did not study the choice of clustering method nor embedding sample size in the quantization step, and we limited our experimentation to opensource encoders only.Future work involves Pytorch implementation for GPU acceleration, enabling kernels to learn higher-order n-grams, adding new variants for kernel fitting, deployments on varied data modalities, and exploring prospector utility with frontier non-Transformer architectures (e.g., state-space models (Gu et al., 2021;Poli et al., 2023a) and their attention hybrids (Poli et al., 2023b)) and API-locked LLMs (Bommasani et al., 2023).
We anticipate many potential use cases for prospectors, particularly in tandem with vector databases and in other compound AI systems and agentic workflows (Zaharia et al.).
One particular use case is to screen or classify data with FMs equipped with performant classifier heads (Swanson et al., 2022), and then swap in prospector heads when feature attribution is required.This process can enable users to investigate multiple class labels (e.g., scientific phenomena) without encoder retraining.Another use case is to use prospector-generated attributions to train downstream rationale models (Jain & Wallace, 2019;Chen et al., 2022a;Yang et al., 2023;Bujel et al., 2023).In general, we believe that prospectors expands the toolkit for improving the transparency and utility of large FMs, high-dimensional data, and large-scale datasets -ultimately inspiring new few-shot inference modes for FMs.For scientific and biomedical applications, including in data-scarce settings, prospectors have the potential to provide mechanistic insights and discover phenomena in complex data (Wang et al., 2023).

Impact Statement
Trust and safety considerations are increasingly important as AI becomes an increasingly prominent part of high-impact disciplines such as science and biomedicine.This concern is particularly relevant for large "black box" foundation models.The goal of this work is to provide a new approach to feature attribution for large models and complex datasets to improve transparency of AI systems.It is important to note that that our method is specifically not designed to be an explanation of a model's reasoning, and any feature attributions made by prospector heads should be carefully interpreted by the user in the context of the data modality.

A Related Work (Extended)
A.1 Feature Attribution via Explanation In the current explanation-based paradigm, feature attributions are referred to as "explanations" and are performed by ( 1) training a supervised model before (2) interrogating the model's behavior (e.g., via internals, forward or backward passes, or input perturbations) and inferring class-specific features.This framework can be described as weak or coarse supervision (Robinson et al., 2020) due to the sole use of class labels as a supervisory signal in combination with a low signal-to-noise ratio in the datum-label pairs -particularly when the prevalence of class-specific features is low (Pawlowski et al., 2019).
Explanations, and feature attribution more broadly, can be categorized as either model-specific methods, which aim to describe how model weights interact with different input features), or model-agnostic methods, which aim to describe how each feature contributes to prediction.Explanation-based attribution methods in general are inherently data-inefficient as they require ample labeled training data to train underlying classifiers.It should be noted that few methods can also be applied to all data modalities.
On the other hand, model-agnostic methods like SHAP (Lundberg & Lee, 2017) and LIME (Ribeiro et al., 2016) perturb input features to determine their differential contribution to classification.Recent work has shown SHAP struggles to localize class-specific regions and is provably no better than random guessing for inferring model behavior or for downstream tasks (Bilodeau et al., 2022).Furthermore, SHAP-style methods can be computationally expensive for a variety of reasons.Some methods face exponential or quadratic time complexities (Ancona et al., 2019) with respect to the number of input features (e.g., pixels in an image) and are thus infeasible for high-dimensional data, while others require multiple forward and/or backward passes (Chen et al., 2022b) or require training additional comparably sized deep networks along with the original classifier (Jethani et al., 2021).

A.2 Modern Encoders & Context Sizes
Most modern encoders for unstructured data operate on tokens, or relatively small pieces of a datum, and their representations.
Tokens can be user-prespecified and/or constructed by the encoder itself (potentially with help from a tokenizer) -where these encoders are respectively referred to as partial-context and full-context (Figure 2).Due to computational constraints, high-dimensional unstructured data (e.g., gigapixel images) often require user-prespecified tokens (i.e., patches) and partial-context encoders that embed each token (Lu et al., 2023;Huang et al., 2023;Klemmer et al., 2023;Lanusse et al., 2023).
We provide an illustrative example for the image modality.In this setting, determining encoder context is based on practical modeling constraints: computational complexity of an architecture's modeling primitives, input data dimensionality, and hardware.For example, an attention-based Vision Transformer (Dosovitskiy et al., 2020) experiences quadratic time complexity (Keles et al., 2022) with respect to input dimension.Standard images (e.g., 224 × 224 pixels) easily fit in modern GPU memory, enabling us to train full-context encoders that construct token embeddings via intermediary layers.However, gigapixel images require user-prespecified tokens (i.e., patches) and partial-context encoders.
Regarding feature attribution for partial context models, gradient-based saliency and attention maps have been used to explain class predictions for high-dimensional unstructured data like gigapixel imagery (Campanella et al., 2019;Chen et al., 2022c).However, studies report low specificity and sensitivity (Machiraju et al., 2022) in part because attribution for the entire datum is built by concatenating attributions across prespecified tokens (e.g., image patches).Partial-context strategies incorrectly assume prespecified tokens are independent and identically distributed (IID).
Our work hinges on the assumption that FMs learn particularly rich embeddings and distributional semantics -and thus, sets of concepts -by virtue of their representational power.While feature attribution has not been explored by adapting FM embeddings, this work is inspired by the recent efforts to perform object detection and visual grounding via FM adaptation (Kuo et al., 2022;Kalibhat et al., 2023).

A.3 Broader Connections across ML
Prospectors bring together ideas from many classical and modern works in adaptation, interpretability, memory augmentation (Khosla et al., 2023), information retrieval, and language modeling.On the surface, prospectors resemble probing models (Alain & Bengio, 2016;Belinkov, 2021), but the fact that they learn token associations between multiple token embeddings is more akin to constellation models (Weber et al., 2000), self-attention layers (Vaswani et al., 2017), or multiple instance learning approaches (Javed et al., 2022).Layer (I) is inspired by concept bottlenecks (Section 2), but extends the definition of concepts to carry spatial semantics.To learn higher-order associations between concepts, i.e., "token mixing" and inductive reasoning, layer (II) is inspired by both sliding window attention (Parmar et al., 2018;Child et al., 2019;Beltagy et al., 2020) and the emergent n-gram circuits seen in transformer induction heads (Akyürek et al., 2024;Olsson et al., 2022;Bietti et al., 2023).We foster the pattern-recognition capability via associative memory units (Hopfield, 1982;1984;Kohonen, 1972;Ramsauer et al., 2020) built with an encoder's learned representations and graphical models (Liu & Mukhopadhyay, 2018;Graves et al., 2013).The result is that while prospectors are inspired by LLM reasoning, their implementation uses efficient statistical techniques, modeling primitives, and data structures.

B Out-of-scope Attribution Methods
For transparency, we also outline our choice to rule out certain baselines for our experiments.A top priority for baseline selection was modality generalizability.
LIME: we rule out LIME (Ribeiro et al., 2016) as a baseline for any of our tested data modalities.This is primarily due to the fact that LIME requires ground truth labels to explain each input.Since the inputs to our partial context encoders (and, in turn, LIME) are prespecified tokens (e.g., sentences for the WikiSection task), LIME requires token-level labels to explain the importance of sub-tokens (e.g., words).This requirement of token-level labels in our setting is fundamentally at odds with prospectors' goal to predict token labels, i.e., learn class-specific tokens de novo.
FastSHAP: we do not compare prospectors to modern methods like FastSHAP (Jethani et al., 2021), which requires training additional models.FastSHAP specifically requires training two comparable models to the original encoder (i.e., with a classifier head) with respect to parameter count: a "surrogate" model that typically mimics the encoder in architecture but trained with a masked-input training regime and an "explainer" model that learns to identify class-specific tokens.Such approaches are out of scope for this work, which aims to perform feature attribution with large models like FMs.Training surrogates for FMs is often practically infeasible.

C.1 Core Definitions
All unstructured data can be represented as map graphs of tokens interacting in physical space.We introduce mathematical definitions to describe these representations.Map graphs are also depicted in Figure S1.Definition C.1 (Map Graph).A map graph G(V, E) is a collection of vertices V and edges E connecting neighboring vertices in Euclidean space.Each vertex v (i) ∈ V has features x (i) and each edge e (i↔j) ∈ E connects vertices v (i) to v (j) .Definition C.2 (Connectivity).A map graph G's connectivity δ is its maximal node degree.

C.2 Visualizing Prospectors
We choose to visualize any dictionaries created by prospectors (e.g., kernel ω and during rollup Appendix C.3.1) in two main styles throughout this work.Firstly, visualization can take the form of (1) semantic networks, which easily allow us to visualize either frequencies (in sprites) or importance weights (in kernels) for monogram and skip-bigram associations.These plots are sometimes referred to as "chord diagrams" or "circos plots".This data structure is defined mathematically as a self-complete graph: Definition C.5 (Self-complete graph).A self-complete graph K K (V, E) is a fully connected graph with K vertices, where every pair of distinct vertices v (i) , v (j) is connected by a unique edge e (i↔j) .It also contains all self-edges that connect any vertex v (i) to itself with edge e (i↔i) .Thus, self-complete graphs contain K vertices and K + K 2 edges.

C.3.1 LAYER II
The rollup operator, named after the function of the same name in relational databases, draws similarity to a sliding bag of words featurization scheme.Internally, a dictionary ζ is constructed to capture all monograms and skip-bigrams in each neighborhood of S. This operator is described by Algorithm 1 and depicted in Figure S2.For a full view of fitting layer (II), including both steps 1 (rollup) and 2, refer to Figure 5.We note that all sprite embeddings created in rollup were normalized using TF-IDF scaling (Sparck Jones, 1972) prior to kernel fitting.Linear classifier: This variant was implemented with the sklearn python package.The elastic net classifiers (λ = 0.5) trained for a maximum of 3000 iterations using the saga solver.

Algorithm 1 rollup
Fold-change computation: In order to supply an alternative to regularization for fold-change variants, we use two-way thresholding as inspired by differential expression analysis (Anders & Huber, 2010).These thresholds offer a form of "masking" importance weights w i ∈ w.As described in the main body of this work, the first threshold is τ , or the minimum fold-change required.The other threshold is α, which is a threshold used for a statistical hypothesis test, which is tests for independent class means.This test is conducted for each weight entry w i in w and significance is assessed via a Mann-Whitney U hypothesis test.Prior to weight masking, given the number of independent tests being conducted, we adjust our chosen significance threshold using the commonly used Bonferroni correction: our original α threshold is divided by the number of entries in w (|Z|), i.e., α * := α/|Z|.Finally, to perform masking: we use ±τ to mask out sufficiently small absolute fold changes (e.g., ±1, which indicates a requirement for doubling in log 2 -scale), and use α * to mask out non-significant differences assessed by our hypothesis test.

C.4 Inferential Time Complexity
Here we conduct a comparative runtime analysis, where we analyze the worst-case time complexity required to explain a single input datum.We focus our analysis on the image modality due to compatibility with many baseline attribution methods.Suppose we have a trained encoder (e.g., an end-to-end classifier, unsupervised learner, etc.) and our datum has T = |V| (tunable) tokens to analyze.Importantly, full-context encoders process all T tokens at once while partial-context encoders process T tokens in sequence.This distinction affects runtime complexity, so we analyze complexity for both partial-and full-context settings.

C.4.1 PROSPECTORS
To analyze prospectors, we consider two main variables in computation: the number of tokens T and the number of operations for a forward pass (F ) of the underlying encoder.Given these variables, prospectors themselves require only O(T ) computations per layer at inference time: O(T ) to quantize each token and O(T ) to traverse over all tokens during convolution.The latter operation ignores a near-constant term for the worst-case number of interactions, i.e., skip-bigrams between central token and tokens in the r-neighborhood.The worst-case number of interactions is modality-and userspecific and is dependent on G's topology, r, and connectivity δ (i.e., max node degree).Parameters r and δ are both C.7 Theoretical Insights

C.7.1 IMPLICIT n-GRAMS
Prospector heads are inspired by the induction heads (Olsson et al., 2022), also referred to as n-gram heads (Akyürek et al., 2024), found in trained transformers for language modeling -even inspiring our method's name.While induction heads perform a sort of "pattern completion" (Olsson et al., 2022) using tokens, our approach achieves a form of "pattern recognition" and simplifies this computation and parameter space in multiple ways: a quantization of token to a set of K concepts and the explicit learning of monograms and skip-bigrams (n = 1 and n = 2).
We claim that this strategy to learn monograms and skip-bigrams is sufficient for implicitly learning higher order n-gram targets.Namely, we argue that skip-bigrams can be "chained" together to form implicit skip-n-grams during attribution, i.e., during the creation of prospect maps in layer (II).For example, iteration i of convolution may find a skip-bigram of concepts A-B within the r-sized receptive field (i.e.A and B may be up to r hops away) and then iteration i + r may find a skip-bigram of concepts B-C.Together, one can argue that both skip-bigrams form an implicit skip-trigram A-B-C.This implicit chaining of skip-n-grams can also lead to implicitly capturing longer-range dependencies.In Theorem 1 below, we show that skip-n-grams can be implicitly chained up to (n − 1)r hops away in a map graph of tokens, G.
Theorem C.6 (Range of implicit n-grams).Given a map graph G of cardinality T , prospectors with receptive field r and an ideal kernel can find all target 1-grams, skip-2-grams, . . ., skip-n-grams spanning up to (n − 1)r node hops.
Proof sketch.First, we explore the n := 1 case (i.e., monograms).Here, all target 1-grams are found trivially via kernel look-ups.Next, we take a look at the n := 2 case (i.e., skip-bigrams).Given the receptive field r, skip-bigrams can be found up to r hops away from the central node.Both the n := 1 and n := 2 cases can be generalized to single k2conv iterations over a large graph G (large T ) or for small G where T ≤ 2r (G fully captured within r hops).Given prospectors natively find monograms and skip-bigrams, multiple convolutional iterations are needed to find n ≥ 3. We explore these cases next.
For the n := 3 case, i.e., skip-trigrams, two skip-bigrams must be found in sequence with a shared token between them.We call this process "bigram chaining."Given a skip-bigram can be learned over r hops, prospectors can thus learn a skip-trigram over 2r node hops.The desired property trivially generalizes over any choice of n (and r) via induction.■ Through the kernel's "memorization" of salient monograms and skip-bigram "rules," prospectors offer flexibility without exorbitant parameterization (as with attention) -i.e., the kernel does not need to see and learn a particular skip-n-gram in training, but at inference-time it can implicitly construct and recognize higher order skip-n-grams from its learned bigrams.

C.7.2 PROSPECTOR FAILURE MODES
One potential failure mode for prospection is triggered by small receptive fields (r), which can prevent prospectors from learning target skip-bigrams or skip-n-grams for any n.In the previous section, we show how prospectors can potentially "chain" skip-bigrams to implicitly learn higher-order skip-n-grams (as seen with transformer induction heads).However this expressivity is hinged on a sufficient choice of r -prospectors must ensure the r-size field captures the target bigrams at the minimum.We hope to study this potential failure mode with synthetic benchmarks in future works.

C.7.3 IMPOSSIBILITY THEOREMS FOR FEATURE ATTRIBUTION
Finally, another main motivation in prospector design is recent work on impossibility theorems (Bilodeau et al., 2022), showing that (a) complete and (b) linear attribution methods can provably fail to improve on random guessing for inferring model behavior.Our approach sought to develop attribution methods outside of these traditional axioms (a) and (b) (Sundararajan et al., 2017).Prospector heads are not complete by nature of not constraining all token attribution scores in a datum to sum to a class prediction.The linear model variant uses its coefficients to attribute tokens, while the fold-change variant does not even output a class prediction.

D Experimental details D.1 Speed benchmarking for Inference
We run a speed benchmarking analysis between two main encoder-attribution pipelines: (1) MiniLM with a prospector head and (2) DeBERTa with a zero-shot classification head and PartitionSHAP.Given the Huggingface implementation for We preprocess the data into classes by searching each document for the presence of "disease.genetics"section labels.If this section label is found, we assign a document-level label of class 1 and class 0 otherwise.Because our task is at the sentence-level, we then create tokens by breaking sections into sentences by the full-stop delimiter (".").We then label sentences by their source section labels.Raw-text sentences are then fed into our chose encoder, which handles natural language tokenization.

D.5.2 IMAGES (CAMELYON16)
This benchmark contains 400 gigapixel whole slide images (270 train, 130 test) of breast cancer metastases in sentinel lymph nodes.All images were partitioned into prespecified patch tokens (size 224 × 224) and filtered for foreground tissue regions (as opposed to the glass background of the slide).This process resulted in more than 200K unique patches without augmentation.For ground truth annotations, binary masks were resized with inter-area interpolation and re-binarized (value of 1 is assigned if interpolated value > 0) to match the dimensionality of data sprites.
We also visualize the token embedding spaces of our encoders for the image task in Figure S3.The lack of natural clustering of class 1 -specific tokens (thick × × × markers) from class 0 tokens (• markers) intuitively depicts the difficulty of our task.In other words, class-specific regions are made up of tokens that are conceptually similar to non-region tokens.

D.5.3 GRAPHS (METALPDB)
We constructed a binary classification dataset of zinc-binding and non-binding proteins from the MetalPDB dataset (Putignano et al., 2018).We specifically focus on proteins annotated as enzymes, since metal ions are often critical for enzymatic activity.Such enzymes are known as metallo-enzymes, and our global classification labels reflect whether a metallo-enzyme relies on zinc or a different metal ion.For the positive set, we consider only biologically-relevant zinc ions which occur within a chain (i.e., are bound to residues in the main chain of the protein, rather than ligand-binding or crystallization artifacts).We sample only one protein chain from each enzymatic class, as determined by Enzyme Commission numbers (Bairoch, 2000), selecting the structure with the best crystallographic resolution.This process resulted in 756 zinc-binding sites from 610 proteins, with 653 corresponding non-zinc-binding proteins sampled from unique enzymatic classes using the same procedure.For each zinc ion in the positive set, we extract all interacting residues annotated in MetalPDB to serve as our ground truth nodes for feature attribution.This dataset was split by enzyme class to ensure that no enzyme exists in both train and test sets, reserving 20% of chains for held-out evaluation.After removing four structures which produced embedding errors, this produced a training set of 1007 unique protein chains for the train set and 252 for the test set.Each protein is featurized as a graph where each node represents a residue and edges are defined between residues which share any atom within a distance of ϵ angstroms, where ϵ varies the density of the graph.

D.6 Multi-class Settings
Prospectors can be easily adapted to the multi-class setting by training multiple models for each class of interest.For example, if faced with three classes a, b, c, prospectors could be applied in the following settings (class-1 and class-0, respectively): • Prospector trained on a vs. {b, c} • Prospector trained on b vs. {a, c} • Prospector trained on c vs. {a, b} In fact, both our protein (MetalPDB) and text (WikiSection) datasets are adapted from multi-class settings: MetalPDB contains data for many different metals, and WikiSection contains 27 different labels in the English disease document subset.
In each case, we selected one class to evaluate for simplicity (zinc-binding proteins and genetics-related text, respectively), but one could easily construct an analogous dataset and train a model for any other class label.

D.7 Pre-trained Encoders
We outline specific models and how to access them.

D.11 Robustness to Coarse Supervision
We also briefly study the robustness of prospector (and top baseline) test-set performance with respect to salient region characteristics: region prevalence and mean region dispersion.We display these results in figures S4, S5, and S6.The more that lines gravitate to the top of each plot, the more robust an encoder-attribution pipeline is to target region characteristics.Lines are created by convolving over the test-set examples.

D.11.1 ADDITIONAL ANALYSIS FOR METALPDB
The metal-binding protein task is particularly challenging as the majority of its class-specific regions are below 0.1 prevalence, but prospectors were nonetheless able to achieve high performance on most test-set examples (Figure S6).Interestingly, ESM2 showed bimodal performance, with high AUPRC on one subset and a correlated, low performance on another.This suggests that a subset of data does not contain clear sequence patterns that are correlated with zinc binding, while structure-based encoders can capture local interactions between residues far apart in sequence.In addition to the prevalence of class-specific regions, mean region dispersion provides a view into their spatial organization.X-axis labels are colored by the number of cysteine residues coordinating the central zinc ion, which is a key feature that is correlated with the resulting clusters: orange=3, red=4.

D.12 Domain-Specific Analysis of Prospector Internals
Department of Computer Science, Stanford University 4 Department of Electrical & Computer Engineering, University of Waterloo 5 Department of Radiology, Stanford University.Correspondence to: Gautam Machiraju <gmachi@stanford.edu>.Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria.PMLR 235, 2024.Copyright 2024 by the author(s).

Figure 1 :
Figure 1: Explanation-based attribution can be conceptualized as a "wrapper function" for trained classifiers using internals, forward or backward passes, or input perturbations.Prospector heads are instead encoder-equippable like classifier heads and adapt token embeddings with data-and time-efficiency.Flame icon indicates trainable parameters.

Figure 3 :
Figure 3: Prospector-equipped encoders produce attribution maps (called "prospect maps") over two layers.Details for fitting and inference are in Sections 3.3, 3.4, and C.3.

Figure 5 :
Figure 5: Layer (II) fitting and inference (K = 5).Concept frequencies are used to build sprite embeddings, which are used to fit a K2conv kernel.Flame icon indicates trainable parameters.
3.4.2KERNEL FITTING (LAYER II) Fitting the K2conv kernel involves computing the classattribution weights for each monogram and skip-bigram in Z across the training set.These weights represent the only learnable parameters of a prospector head.The total number of parameters |Z| is thus dependent on K and is at maximum (Appendix C.3.2): |Z| = 2K + K

Figure 6 :
Figure 6: Prospectors vs. baselines for sequences (left), images (middle), and graphs (right).Dots represent performance on individual test-set examples, while bars represent means with whiskers as standard errors.Numerical results are found in Appendix D.10.

Figure 7 :
Figure 7: Robustness analysis for Camelyon16 data: prospector and top baseline performance with respect to region characteristics.

Figure 8 :
Figure 8: Prospector visualization for pathology, using the top PLIP configuration (Table S4).Visualizations are shown for two test-set examples, from left to right: data sprites; sprite embeddings viewed as semantic networks and heatmaps (Appendix C.2), where line thicknesses or cell shade reflect monogram or skip-bigram count; the kernel viewed as a semantic network and heatmap, where line thickness and cell intensity reflect learned weights; prospect map, with vertex attribution scores mapped back onto tokens in original data; and ground-truth class-specific regions in the image (in red).Sprites and sprite embeddings are colored by the K = 15 learned concepts.Kernel weights and prospect maps are colored red and blue to reflect class1-specific and class0-specific associations, respectively.

Figure 9 :
Figure 9: Prospector visualization for protein data, using the top COLLAPSE configuration (TableS4).We show the same five visualizations as before, as well as a visualization of the atomic configuration of the zinc binding site to illustrate the binding motifs discovered by sprite embedding clustering (FigureS7).Whole proteins are visualized as cartoons instead of graphs for clarity.
Definition C.3 (Partial-context Encoder).Given a map graph G, an encoder f is considered partial-context if it produces an embedding x = f (v) ∈ R d .Definition C.4 (Full-context Encoder).Given a map graph G, an encoder f is considered full-context if it produces embeddings [x 1 . . .x T ] = f (G), where x i ∈ R d ∀i = 1 . . .T .

Figure S1 :
Figure S1: Unstructured data represented as map graphs.Sequences (A) and images (B) require a specified resolution (e.g., words or sentences for text, pixels or patches for images) and connectivity (e.g., 2-hop, 8-way) for discretization.

Figure S2 :
Figure S2: A depiction of the rollup operation as seen in layer (II).

Figure S3 :
Figure S3: t-SNE plots, left-to-right then top-to-bottom: tile2vec, ViT, CLIP, PLIP.Marker color denotes concept, marker type denotes ground truth annotation for a patch: • for class0, × for patches from class1 images that do originate from target regions, and the much thicker × × × for class1-specific target regions.

Figure S4 :
Figure S4: Robustness for WikiSection data.Top baseline, MiniLM encoder with MLP head, is denoted by a black dashed line.

Figure
Figure S7 displays the results of hierarchically clustering sprite embeddings for the zinc binding task.

Figure S5 :
Figure S5: Robustness for Camelyon16 data.Top baseline, ViT encoder with prediction probabilities, is denoted by a black dashed line.

Figure S6 :
Figure S6: Robustness for MetalPDB data.Top baseline, COLLAPSE encoder with GAT head and GNNExplainer, is denoted by a black dashed line.

Figure S7 :
Figure S7: Hierarchical clustering results for sprite embeddings computed from high-precision test-set examples.X-axis labels are colored by the number of cysteine residues coordinating the central zinc ion, which is a key feature that is correlated with the resulting clusters: orange=3, red=4.

Table 1 (
and Appendix D), baselines in Appendix D.9, and ruled-out baselines in Appendix B.
For both baselines and prospectors, we perform a gridsearch over tunable hyperparameters.Due to the MIA, the best models were selected based on their ability to localize ground truth class 1 regions in the training set, since these were not seen by prospectors during training.We use a sequential ranking criteria over four token-level metrics: precision, dice coefficient, Matthews correlation coefficient, and AUPRC.Details of hyperparameter tuning and model selection are found in Appendix D.2 and D.3.The results in the remainder of this paper present the localization AUPRC and average precision (AP) over a set of thresholds, for class 1 regions in our held-out test data.

Table S4 :
Top prospectors per encoder, after model selection and sequential ranking.All selected prospectors except AA are parameter-free fold-change variants.* The AA encoder does not use clustering for quantization, since amino acids are already a discrete set of 21 tokens (20 standard amino acids + 1 entry for any non-standard amino acid).Symbol "-" denotes non-applicable hyperparameter.